##Goodreads Book Review Dataset
#Goodreads is a website used for reviewing books
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 0.5.2
## Warning: package 'readr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(DT)
## Warning: package 'DT' was built under R version 4.2.3
library(plotly) #this imports some of the functions I used
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
goodreads <- read.csv("C:\\Users\\Samantha\\Desktop\\Past Classes\\2023 Spring Semester\\DSCI 101\\archive\\books.csv")
goodreads #here you can see every book on Goodreads in the order they were added
The Goodreads dataset was added to Kaggle in May 25, 2019 as a way of interacting with data gained from the Goodreads API. As Goodreads ceased in offering this data on December 8th, 2020, this dataset contains no information on books added beyond that date. The dataset contains information on the date that books were published, the publisher name, title, author(s), language, isbn codes, number of pages, number of text reviews, number of ratings, and average rating from Goodreads users.
byyear <- goodreads %>% mutate(year = format(as.Date(publication_date, "%m/%d/%Y"), format = "%Y"))
byyear_ <- byyear %>% group_by(year) %>% summarise(booksnumber = n()) %>% filter(na.rm = TRUE)
plot_byyear3 <- ggplot(data = byyear_, aes(x = year, y = booksnumber)) +
geom_bar(stat = "identity", fill = "blue4") +
ggtitle("Figure 1: Books Added To Goodreads Per Year",
subtitle = "Goodreads was founded December 2006") +
xlab("Year Since 1900") +
ylab("Number of Books") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
plot.subtitle = element_text(color = "grey", face = "italic"))
ggplotly(plot_byyear3)
When perusing the books stored in Goodreads, it’s easy to feel like there is an even spread of new releases and established titles. However, the actual catalogue of books reviewable on Goodreads includes many more of the books that were in recent memory when the site was founded. The curve of books added to Goodreads is exponential towards the year of its founding, but is more than halved during the first year of the site’s existence. Though the dates that the books were actually added to Goodreads is not stored, it is likely that most of their catalogue was added before the site launched. This plot can be hovered over to see the exact number of books per year.
reviewratio <- goodreads %>%
mutate(average_rating = as.numeric(average_rating)) %>%
filter(text_reviews_count > 20, ratings_count > 20, rm.na = TRUE) %>%
mutate(ratings_reviews = (text_reviews_count / ratings_count)) #ratio of reviews: ratings
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
ggplot(aes(x = ratings_reviews, y = average_rating, color = text_reviews_count), data = reviewratio) +
geom_point() +
scale_color_gradient(low = "lightblue", high = "black") +
ggtitle("Figure 2: Book Ratings On Goodreads",
subtitle = "Are users most likely to write reviews for books they enjoyed?") +
xlab("Ratings to Text Reviews Ratio") +
ylab("Rating (in stars out of 5)") +
labs(color = "Number of Text Reviews") +
theme(plot.subtitle = element_text(color = "grey", face = "italic"))
## Warning: Removed 3 rows containing missing values (`geom_point()`).
The top ten books for the largest ratio of text reviews to star ratings are obscure, with less than two reviews and ratings each. They are also mostly in different languages, notable because most of Goodreads’s books are in English. The top ten for the smallest ratio of text reviews to star ratings are several Wrinkle In Time reprints, with around ~15 ratings and 0 reviews each. To remove these results, I dropped all entries with a number of ratings or reviews below 20.
It was expected that the books with the highest number of text reviews would be those that are rated either very high or very low, as people enjoy talking about things they love and things they hate. However, we can see in figure 1 that the books that are actually most likely to have more reviews are those that are not rated very much at all.
The very visible black dot with a rating of 3.59 and a ratings to reviews ratio of 0.02 is the first book in the Twilight series. Its large number of text reviews are still smaller than its number of ratings. The higher-rated black dot is The Book Thief and the greyer dots are The Giver by Lois Lowry, Paulo Coelo’s The Alchemist, Water for Elephants, the first Percy Jackson book, Eat Pray Love, The Glass Castle, The Catcher in the Rye, and the third Harry Potter book. These books inspire many people to talk about them, but inspire even more people to only rate them.
reviewratio %>% arrange(-text_reviews_count) #Fig 2a, the table of books with the highest number of text reviews
Note that Twilight’s first book was the 41,865th book to be added to Goodreads. Note that people usually leave text reviews for the first book of a series, but not in HP#3’s case. From my experience on the website, I would assume they are posting excited gifsets (a time-honored Goodreads tradition) for Sirius Black. Otherwise, many of these books are known for being impactful, and the reviews could contain people talking about their experiences with the book.
prolific_pub <- byyear %>%
mutate(average_rating = as.numeric(average_rating)) %>%
filter(rm.na = TRUE) %>%
group_by(publisher) %>%
summarise(num = n(), avg_book_rating = mean(average_rating)) %>%
arrange(-num) %>%
head(20)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
prolificplot <- ggplot(aes(x = num, y = publisher, fill = avg_book_rating), data = prolific_pub) +
geom_bar(stat="identity") +
ggtitle("Figure 3: Books Per Publisher",
subtitle = "Hover to see average book rating across publishing company") +
xlab("Books Per Publisher") +
ylab("Publisher Names") +
theme(plot.subtitle = element_text(color = "grey", face = "italic"))
ggplotly(prolificplot)
By far, the single publisher with the most books on Goodreads is Vintage. Vintage Books was established in 1954 and published “Guns Germs and Steel” and Paulo Coehlo’s works. This may be because of other publishing houses splitting the books they publish under different names, such as Oxford University Press and Oxford University Press USA, or HarperCollins and Harper Perennial. It is not actually an independent publisher, though, and is actually an imprint of Penguin Random House.
This plot can be hovered over to see the actual average book rating per publisher.
well_rated <- goodreads %>%
mutate(average_rating = as.numeric(average_rating)) %>%
filter(rm.na = TRUE) %>% #, text_reviews_count > 10 when I drop these, Steven King only has 35 books (instead of 65)
group_by(authors) %>%
summarise(avg_rating = mean(average_rating), number_of_books = n()) %>%
arrange(-number_of_books)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
ratedauthors <- ggplot(aes(x = avg_rating, y = number_of_books, text = authors), data = well_rated) +
geom_point(name = 'authors') +
ggtitle("Figure 4: Book Quantity Versus Quality",
subtitle = "Hover to see author name") +
xlab("Average Rating of an Author's Body of Work") +
ylab("Number of Books") +
theme(plot.subtitle = element_text(color = "grey", face = "italic"))
## Warning in geom_point(name = "authors"): Ignoring unknown parameters: `name`
ggplotly(ratedauthors)
Sometimes, quantity outshines quality. Sometimes, authors publish show-stopping work after show-stopping work. You may notice that there are many mangaka on the top end of the scale ratings-wise, such as Hiromu Arakawa (who wrote/illustrated Fullmetal Alchemist, the source material of the #1 anime on myanimelist.com since time immemorial) and Hirohiko Araki (who writes/illustrates Jojo’s Bizarre Adventure). Fans of both works are notoriously dedicated, which could explain why Arakawa and Araki’s works are consistently high-rated, despite each translation counting as a different author name in this dataset. Virginia Woolf and Ovid also appear highly rated alongside translators and abridgers / editors. In the region of high-rated authors with 10 or so entries, you will find Richard P. Feynman, a brilliant and beloved physicist, and Patrick O’Brian, a influential figure in the genre of books that take place at sea. There is a pattern emerging here, in that people seem most likely to rate books that are juggernauts in the author’s niche, but not often talked about outside of it.
It is also important to note that the number of books an author has on Goodreads is not always the same number of books they have written. Wikipedia counts 65 Steven King novels, but he has only 40 entries on Goodreads, or 35 if you remove books which have not been reviewed more than 10 times. As discussed above with the mangaka, multiple editions of a work can inflate the number of books published by one author, or split up books across different names when translators and editors credits are included. Number of books was also taken into account in this figure to account for books which only have one or two ratings. Two authors have an average rating of 0.0 because they have no ratings, and there are authors at 2.0 and 5.0 who must have only been reviewed once or twice.
This plot can be hovered over to see which author had which rating and number of books.
pagerating <- goodreads %>%
mutate(average_rating = as.numeric(average_rating)) %>%
filter(rm.na = TRUE) %>%
group_by(num_pages) %>%
summarise(avg_rating = mean(average_rating), occurances = n()) %>%
arrange(-avg_rating)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
pagerating %>% arrange(-occurances) %>% head(10)
page_rated <- ggplot(aes(y = avg_rating, x = num_pages, text = occurances), data = pagerating) +
geom_point(name = 'title') +
ggtitle("Figure 5: Book Quantity Versus Quality, Pages", subtitle = "Does being too wordy drag down a work?") +
xlab("Number of Pages") +
ylab("Average Rating") +
theme(plot.subtitle = element_text(color = "grey", face = "italic"))
## Warning in geom_point(name = "title"): Ignoring unknown parameters: `name`
ggplotly(page_rated)
The same as the above but for page number rather than book amount! This plot can be hovered over to see the number of books that have each length, and the table displays the 10 most common page length.
booktitles <- goodreads %>% group_by(title) %>%
summarise(num = n()) %>% arrange(-num)
datatable(booktitles, options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
These are the Goodreads titles arranged in order of most duplicates. There are 9 pages on Goodreads for different versions of The Brothers Karamazov, but without the translations having different titles. This table is interactable, and you can search up books to see if there are others with the same title.
reviewratio %>% arrange(-ratings_reviews) #bonus table A, books with the most reviews relative to ratings
reviewratio %>% arrange(-average_rating) #bonus table B, books with the highest ratings.
Please note how Calvin and Hobbes top the list of most highly-rated books on Goodreads through multiple criterion. As Lemon Demon would say, “Bill Waterson can’t you hear me. Bill Waterson don’t you fear me. Bill Waterson don’t treat me like I have rabies.” and so on.
w_r <- well_rated %>% filter(number_of_books > 5)
w_r %>% arrange(-avg_rating) #bonus table C, Hiromu Arakawa rules all.
JK Rowling and Mary GrandPre are paired here as author and illustrater much in the way that authors were paired with translators and editors in Figure 4. Tolkien is surprisingly outranked by Harry Potter both in illustrated and unillustrated form. While unfortunate, this is to be expected from a book website that added the Harry Potter series first, out of all of the books in the world. On a lighter note, you will also notice that the best rated editions of Jojo’s Bizarre Adventure are the ones that list Araki as both writer and illustrator. Bill Watterson is again unseatable as the Goodreads author with the genuine highest ratings. Patrick O’Brian also makes a repeat appearance.
reviewratio %>% arrange(-ratings_reviews) #bonus table D, books with the most reviews